Take Home Ex 3

Published

March 11, 2023

Modified

March 21, 2023

Getting Started

In this take-home exercise, you are tasked to predict HDB resale prices at the sub-market level (i.e.HDB 3-room, HDB 4-room and HDB 5-room) for the month of January and February 2023 in Singapore. The predictive models must be built by using by using conventional OLS method and GWR methods. You are also required to compare the performance of the conventional OLS method versus the geographical weighted methods.

The study should focus on either three-room, four-room or five-room flat and transaction period should be from 1st January 2021 to 31st December 2022. The test data should be January and February 2023 resale prices.

The Data

Data sets will be used in this model building exercise, they are:

  1. MP 2014 Subzone Boundary

  2. Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2011-2020

  3. HDB Resale Price

Loading of packages

pacman::p_load(olsrr, corrplot, ggpubr, sf, spdep, GWmodel, tmap, tidyverse, gtsummary,mapview,leaflet,RColorBrewer,tidygeocoder,jsonlite,httr,onemapsgapi,reporter,magrittr,readxl,gdata, units,matrixStats, SpatialML,rsample, Metrics)

Importing Geospatial Data

After I imported the subzone layer and analysed below, I’ve found out that there are some subzone area which can be removed because the area are mostly industrial/campsite and would not have any residents living there.

mpsz19 <- st_read(dsn = "data/geospatial", layer = "MPSZ-2019") %>% 
  st_transform(3414) %>% filter(SUBZONE_N != "NORTH-EASTERN ISLANDS" & SUBZONE_N != "SOUTHERN GROUP"& SUBZONE_N != "SEMAKAU"& SUBZONE_N != "SUDONG")
Reading layer `MPSZ-2019' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84
st_crs(mpsz19)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_bbox(mpsz19)
     xmin      ymin      xmax      ymax 
 2667.538 21448.473 55941.942 50256.334 

After plotting a map, we can see that there is an island called NORTH-EASTERN ISLAND, SOUTHERN GROUP, SEMAKAU and SUDONG which is out of analysis area. we should remove them for the purpose of this takehome exercise. We can re-import the data with out this names by doing a filter(). Re run the map and look! its clean now. We can move on to our analysis.

Population Analysis

Importing Aspatial Data Population

In this section, I will be importing Singapore population data. In our Hands-on Ex03 we have use a choropleth map to portray the spatial distribution of aged population of Singapore by Master Plan 2014 Subzone Boundary. Let’s further analyse this data and code further for our analysis.

popdata <- read_csv("data/aspatial/population_2011to2020.csv")
popdata2020 <- popdata %>%
  filter(Time == 2020) %>%
  group_by(PA, SZ, AG) %>%
  summarise(`POP` = sum(`Pop`)) %>%
  ungroup()%>%
  pivot_wider(names_from=AG, 
              values_from=POP) %>%
  mutate(YOUNG = rowSums(.[3:6])
         +rowSums(.[12])) %>%
mutate(`ECONOMY ACTIVE` = rowSums(.[7:11])+
rowSums(.[13:15]))%>%
mutate(`AGED`=rowSums(.[16:21])) %>%
mutate(`TOTAL`=rowSums(.[3:21])) %>%  
mutate(`DEPENDENCY` = (`YOUNG` + `AGED`)
/`ECONOMY ACTIVE`) %>%
  select(`PA`, `SZ`, `YOUNG`, 
       `ECONOMY ACTIVE`, `AGED`, 
       `TOTAL`, `DEPENDENCY`)

As we look at the raw data, we can create a new dataframe to combine the number of population and group by PA and SZ.

Joining the attribute data and geospatial data

For master plan and Pop data frame, there are 2 common variable which we can join them together. PLN_AREA_N = PA & SUBZONE_N = SZ but we have to convert them to upper case first

popdata2020 <- popdata2020 %>%
  mutate_at(.vars = vars(PA, SZ), 
          .funs = funs(toupper)) %>%
  filter(`ECONOMY ACTIVE` > 0)

left join both master plan and population data by their sub zone level.

mpsz_pop2020 <- left_join(mpsz19, popdata2020,
                          by = c("SUBZONE_N" = "SZ"))

Exploratory Data Analysis (EDA)

tm_shape(mpsz_pop2020)+
  tm_fill("DEPENDENCY",
          style = "quantile",
          palette = "Greens") +
  tm_borders(alpha = 0.5)

tmap_options(check.and.fix = TRUE)
tm_shape(mpsz_pop2020)+
  tm_fill("TOTAL",
          style = "quantile",
          palette = "Blues") +
  tm_borders(alpha = 0.5)

Based on the chart above, we can see that in general, the east area is more dense as compared to the other region. In our handson 3, we compare the young and age, but for analyzing the resale price, it would be more useful to compare economic active and aged group instead young and aged group. Because these 2 age group is more appropriate to get housing as compared to the young.

Lets compare between the Active and Aged groups

tmap_mode("view")
tmap_options(check.and.fix = TRUE)
activemap <- tm_shape(mpsz_pop2020)+ 
  tm_polygons("ECONOMY ACTIVE", 
              style = "quantile", 
              palette = "Blues",
              )+ 
  tm_view(set.zoom.limits = c(9,14))

agedmap <- tm_shape(mpsz_pop2020)+ 
  tm_polygons("AGED", 
              style = "quantile", 
              palette = "Oranges") +
  tm_view(set.zoom.limits = c(9,14))

tmap_arrange(activemap, agedmap, asp=1, ncol=2, sync = TRUE)
tmap_mode("plot")

Based on the interactive graph above, we can say that the east area has a more dense aged group as compared to the economy active group. In general, I can also say that; the east area has a higher aging population as compared to other areas. Furthermore, we can also conclude that there are more resident staying in the east side than other region.

Existing Dwelling Analysis

Notice that in the excel sheet, there is a column called dwelling as shown below. We can also further analysis the current type of dwelling to narrow down which type of housing are most common/popular.

dwelling2020 <- popdata %>%
  filter(Time == 2020 & 
           TOD == "HDB 3-Room Flats" |
           TOD == "HDB 4-Room Flats" |
           TOD == "HDB 5-Room and Executive Flats") %>%
  group_by(PA, SZ, TOD) %>%
  summarise(`POP` = sum(`Pop`))

Joining the attribute data and geospatial data

dwelling2020 <- dwelling2020 %>%
  mutate_at(.vars = vars(PA, SZ), 
          .funs = funs(toupper)) %>%
  filter(`POP` > 0)
mpsz_dwelling2020 <- left_join(mpsz19, dwelling2020,
                          by = c("SUBZONE_N" = "SZ"))

Exploratory Data Analysis (EDA)

 tmap_mode("view")
HDB3room <- tm_shape(mpsz_dwelling2020 %>% filter( TOD == "HDB 3-Room Flats"))+
  tm_fill("POP",
          style = "quantile",
          palette = "Greens") +
  tm_view(set.zoom.limits = c(10,14))

HDB4room <- tm_shape(mpsz_dwelling2020 %>% filter( TOD == "HDB 4-Room Flats"))+ 
  tm_polygons("POP", 
              style = "quantile", 
              palette = "Blues",
              )+ 
  tm_view(set.zoom.limits = c(10,14))

HDB5room <- tm_shape(mpsz_dwelling2020 %>% filter( TOD == "HDB 5-Room and Executive Flats"))+ 
  tm_polygons("POP", 
              style = "quantile", 
              palette = "Reds") +
  tm_view(set.zoom.limits = c(10,14))

tmap_arrange(HDB3room, HDB4room, HDB5room, asp=1, ncol=3, sync = TRUE)

Mentioned above that the study should focus on either three-room, four-room or five-room flat. Based on the chart above, we can see that 4-room flat has a higher results as compared to 3-room and 5-room due to its scale of 536,920 for 4-room, next highest is 495,380 for 5-room and 30,880 for 3-room. With this dwelling analysis, I will be focusing on 4-Room flat cause of its popularity in Singapore.

Importing Locational factors

The data mostly are taken from data.gov or LTA data mall factors include:

  1. Bus stop location
  2. Eldercare
  3. Foodarea
  4. schools
  5. childcare
  6. train station kaggel
  7. malls from kaggle
  8. supermarket
  9. hospital (collated manually) SGhospital
  10. primary schools
  11. good primary schools
    • As there are many views about elite schools in Singapore as attending an elite primary school in Singapore is important depends on an individual’s unique circumstances and goals. In this take home exercise, I will be picking the top 10 most common popular primary schools. These are the top 10 popular schools according to this website.
busstops <- st_read(dsn = "data/aspatial/BusStopLocation", layer = "BusStop")
Reading layer `BusStop' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\aspatial\BusStopLocation' 
  using driver `ESRI Shapefile'
Simple feature collection with 5159 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
eldercare <- st_read(dsn = "data/aspatial/eldercare-services-shp", layer = "ELDERCARE")
Reading layer `ELDERCARE' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\aspatial\eldercare-services-shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 133 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21
poi <- st_read(dsn = "data/aspatial/poi_singapore", layer = "Singapore_POIS")
Reading layer `Singapore_POIS' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\aspatial\poi_singapore' 
  using driver `ESRI Shapefile'
Simple feature collection with 18687 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3699.08 ymin: 22453.07 xmax: 52883.86 ymax: 50153.8
Projected CRS: SVY21 / Singapore TM

After we see the poi datafame, lets check for unique values under fclass

unique(poi$fclass)
  [1] "bakery"              "viewpoint"           "atm"                
  [4] "attraction"          "toilet"              "fast_food"          
  [7] "convenience"         "supermarket"         "food_court"         
 [10] "bank"                "post_office"         "restaurant"         
 [13] "cafe"                "monument"            "shelter"            
 [16] "tourist_info"        "bench"               "college"            
 [19] "telephone"           "memorial"            "bar"                
 [22] "pub"                 "pharmacy"            "cinema"             
 [25] "theatre"             "department_store"    "school"             
 [28] "playground"          "water_tower"         "fire_station"       
 [31] "vending_any"         "dentist"             "post_box"           
 [34] "stadium"             "hospital"            "police"             
 [37] "recycling"           "artwork"             "camera_surveillance"
 [40] "library"             "car_dealership"      "laundry"            
 [43] "observation_tower"   "lighthouse"          "sports_centre"      
 [46] "fountain"            "gift_shop"           "doityourself"       
 [49] "toy_shop"            "florist"             "hairdresser"        
 [52] "clothes"             "beverages"           "jeweller"           
 [55] "vending_machine"     "kindergarten"        "greengrocer"        
 [58] "butcher"             "battlefield"         "museum"             
 [61] "waste_basket"        "pitch"               "community_centre"   
 [64] "drinking_water"      "clinic"              "market_place"       
 [67] "general"             "picnic_site"         "bookshop"           
 [70] "travel_agent"        "camp_site"           "embassy"            
 [73] "town_hall"           "doctors"             "bicycle_rental"     
 [76] "zoo"                 "bicycle_shop"        "comms_tower"        
 [79] "tower"               "guesthouse"          "nightclub"          
 [82] "university"          "computer_shop"       "sports_shop"        
 [85] "arts_centre"         "swimming_pool"       "furniture_shop"     
 [88] "theme_park"          "veterinary"          "beauty_shop"        
 [91] "outdoor_shop"        "video_shop"          "ice_rink"           
 [94] "motel"               "public_building"     "car_rental"         
 [97] "stationery"          "newsagent"           "chemist"            
[100] "mobile_phone_shop"   "ruins"               "hotel"              
[103] "park"                "shoe_shop"           "optician"           
[106] "mall"                "hostel"              "car_wash"           
[109] "kiosk"               "chalet"              "water_well"         
[112] "prison"              "wayside_shrine"      "recycling_paper"    
[115] "wayside_cross"       "car_sharing"         "garden_centre"      
[118] "recycling_glass"     "dog_park"           

We notice that there are a lot of category, lets only select supermarket for our analysis as there are some inaccuracy in certain classes.

supermarkets <- st_read(dsn = "data/aspatial/poi_singapore", layer = "Singapore_POIS") %>% 
  filter(fclass == "supermarket")
Reading layer `Singapore_POIS' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\aspatial\poi_singapore' 
  using driver `ESRI Shapefile'
Simple feature collection with 18687 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3699.08 ymin: 22453.07 xmax: 52883.86 ymax: 50153.8
Projected CRS: SVY21 / Singapore TM
foodarea <- st_read(dsn = "data/aspatial/poi_singapore", layer = "Singapore_POIS") %>% 
  filter(fclass == "food_court"| fclass == "fast_food"
         |fclass =="restaurant" |fclass =="cafe")
Reading layer `Singapore_POIS' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\aspatial\poi_singapore' 
  using driver `ESRI Shapefile'
Simple feature collection with 18687 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3699.08 ymin: 22453.07 xmax: 52883.86 ymax: 50153.8
Projected CRS: SVY21 / Singapore TM
trainstation <- read_csv("data/aspatial/mrt_lrt_data.csv")
childcare <- st_read(dsn = "data/aspatial", layer = "ChildcareServices")
Reading layer `ChildcareServices' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\aspatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 1545 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 11203.01 ymin: 25667.6 xmax: 45404.24 ymax: 49300.88
z_range:       zmin: 0 zmax: 0
Projected CRS: SVY21 / Singapore TM
hospital <- read_excel("data/aspatial/hospital.xlsx")
good_prischool <- read_csv("data/aspatial/school-directory-and-information/general-information-of-schools.csv") %>% filter(
  school_name == "NANYANG PRIMARY SCHOOL" | 
    school_name == "CATHOLIC HIGH SCHOOL" |
    school_name == "TAO NAN SCHOOL" |
    school_name == "NAN HUA PRIMARY SCHOOL" |
    school_name == "ST. HILDA'S PRIMARY SCHOOL" |
    school_name == "HENRY PARK PRIMARY SCHOOL" |
    school_name == "ANGLO-CHINESE SCHOOL (PRIMARY)" |
    school_name == "RAFFLES GIRLS' PRIMARY SCHOOL" |
    school_name == "PEI HWA PRESBYTERIAN PRIMARY SCHOOL" |
     school_name == "CHIJ ST. NICHOLAS GIRLS' SCHOOL")
primaryschool <- read_csv("data/aspatial/school-directory-and-information/general-information-of-schools.csv") %>% filter(mainlevel_code =="PRIMARY")
shoppingmalls <- read_csv("data/aspatial/shopping_mall_coordinates.csv")

Change dataframe to sf format

busstop.sf <- st_transform(busstops, crs=3414)
eldercare.sf <- st_transform(eldercare, crs=3414)
foodarea.sf <- st_transform(foodarea, crs=3414)
supermarkets.sf <- st_transform(supermarkets, crs=3414)
childcare.sf <- st_transform(childcare, crs=3414)
hospital.sf <- st_as_sf(hospital,
                            coords = c("Long", "Lat"),
                            crs=4326) %>%
  st_transform(crs=3414)
trainstation.sf <- st_as_sf(trainstation,
                            coords = c("lng", "lat"),
                            crs=4326) %>%
  st_transform(crs=3414)

We note that there are no Lat and Long in the primary school data frame. we need to tidy the data for both primary and good primary schools. To prevent it from rendering the website for too long, I will be exporting and importing the tidied files.

# primaryschool[c('block', 'street')] <- str_split_fixed(primaryschool$address, ' ', # 2)
# primaryschool$street<- trim(primaryschool$street)
# primaryschool$street<- toupper(primaryschool$street)
# good_prischool[c('block', 'street')] <- str_split_fixed(good_prischool$address, ' '# , 2)
# good_prischool$street<- trim(good_prischool$street)
# good_prischool$street<- toupper(good_prischool$street)
#library(httr)
#geocode <- function(block, streetname) {
#  base_url <- "https://developers.onemap.sg/commonapi/search"
#  address <- paste(block, streetname, sep = " ")
#  query <- list("searchVal" = address, 
#                "returnGeom" = "Y",
#                "getAddrDetails" = "N",
#                "pageNum" = "1")
#  
#  res <- GET(base_url, query = query)
#  restext<-content(res, as="text")
#  
#  output <- fromJSON(restext)  %>% 
#    as.data.frame %>%
#    select(results.LATITUDE, results.LONGITUDE)
#
#  return(output)
#}
#good_prischool$LATITUDE <- 0
#good_prischool$LONGITUDE <- 0
#
#for (i in 1:nrow(good_prischool)){
#  temp_output <- geocode(good_prischool[i, 32], good_prischool[i, 33])
#  
#  good_prischool$LATITUDE[i] <- temp_output$results.LATITUDE
#  good_prischool$LONGITUDE[i] <- temp_output$results.LONGITUDE
#}
# primaryschool$LATITUDE <- 0
# primaryschool$LONGITUDE <- 0
# 
# for (i in 1:nrow(primaryschool)){
#   temp_output <- geocode(primaryschool[i, 32], primaryschool[i, 33])
#   
#   primaryschool$LATITUDE[i] <- temp_output$results.LATITUDE
#   primaryschool$LONGITUDE[i] <- temp_output$results.LONGITUDE
# }

Export tidy data for schools

#write.csv(primaryschool,"data/exported/primaryschool.csv")
#write.csv(good_prischool,"data/exported/good_prischool.csv")

Import tidy data for schools

primaryschool <- read_csv("data/exported/primaryschool.csv")
good_prischool <- read_csv("data/exported/good_prischool.csv")

After we have done, we can proceed to convert the df to sf.

shoppingmalls.sf <- st_as_sf(shoppingmalls,
                            coords = c("LONGITUDE", "LATITUDE"),
                            crs=4326) %>%
  st_transform(crs=3414)

good_prischool.sf <- st_as_sf(good_prischool,
                            coords = c("LONGITUDE", "LATITUDE"),
                            crs=4326) %>%
  st_transform(crs=3414)

primaryschool.sf <- st_as_sf(primaryschool,
                            coords = c("LONGITUDE", "LATITUDE"),
                            crs=4326) %>%
  st_transform(crs=3414)

Exploratory Data Analysis (EDA)

tmap_mode("plot")

PLOT_BUS <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(busstop.sf) +
  tm_dots(col="red", size=0.05) +
  tm_layout(main.title = "Bus Stops",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))


PLOT_TRAIN <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(trainstation.sf) +
  tm_dots(col="blue", size=0.05) +
  tm_layout(main.title = "Train Station",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))

tmap_arrange(PLOT_BUS, PLOT_TRAIN, 
             asp=1, ncol=2,
             sync = FALSE)

Notice for bus stop, there are a few points that is out of Singapore, lets remove those points namely “LARKIN TER”, “KOTARAYA II TER”, “JOHOR BAHRU CHECKPT”, “JB SENTRAL”. After that, we can re-run the map above.

busstop.sf <- busstop.sf  %>%
  filter(LOC_DESC != "JOHOR BAHRU CHECKPT" & LOC_DESC != "LARKIN TER"& LOC_DESC != "KOTARAYA II TER"& LOC_DESC != "JB SENTRAL")

Lets looks at other data.

tmap_mode("plot")

PLOT_CHILD <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(childcare.sf) +
  tm_dots(col="orange", size=0.05) +
  tm_layout(main.title = "Child Care",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))


PLOT_ELDER <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(eldercare.sf) +
  tm_dots(col="green", size=0.05) +
  tm_layout(main.title = "Elder Care",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))

tmap_arrange(PLOT_CHILD, PLOT_ELDER, 
             asp=1, ncol=2,
             sync = FALSE)

tmap_mode("plot")

PLOT_PRISCHOOL <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(primaryschool.sf) +
  tm_dots(col="#009999", size=0.05) +
  tm_layout(main.title = "Primary school",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))


PLOT_GOODPRISCHOOL <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(good_prischool.sf) +
  tm_dots(col="red", size=0.05) +
  tm_layout(main.title = "Good Primary School",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))

tmap_arrange(PLOT_PRISCHOOL, PLOT_GOODPRISCHOOL, 
             asp=1, ncol=2,
             sync = FALSE)

tmap_mode("plot")
PLOT_FOOD <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(foodarea.sf) +
  tm_dots(col="#009999", size=0.05) +
  tm_layout(main.title = "Food Area",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))


PLOT_SUPERMART <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(supermarkets.sf) +
  tm_dots(col="#0000FF", size=0.05) +
  tm_layout(main.title = "Supermarket",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))

tmap_arrange(PLOT_FOOD, PLOT_SUPERMART, 
             asp=1, ncol=2,
             sync = FALSE)

After plotting, we saw a point which is not within SG, let’s remove them and re run the map.

foodarea.sf <- foodarea.sf  %>%
  filter(osm_id  != "4493618264")
tmap_mode("plot")

PLOT_HOST <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(hospital.sf) +
  tm_dots(col="red", size=0.05) +
  tm_layout(main.title = "Hospital",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))


PLOT_MALL <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(shoppingmalls.sf) +
  tm_dots(col="orange", size=0.05) +
  tm_layout(main.title = "Shopping mall",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))

tmap_arrange(PLOT_HOST, PLOT_MALL, 
             asp=1, ncol=2,
             sync = FALSE)

Looks like all data points are within Singapore! Next, we will be preparing data for the HDB resale price.

Importing Aspatial Data HDB resale price

We will import the HDB resale prices and also filter the data to 4-room flat from 1st January 2021 to 31st December 2022 fro training purposes

hdb_resale <- read_csv("data/aspatial/resale_flat_price_full.csv")  %>% 
  filter(flat_type == "4 ROOM") %>%
  filter(month >= "2021-01" & month <= "2022-12")
glimpse(hdb_resale)
Rows: 23,656
Columns: 11
$ month               <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021-…
$ town                <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type           <chr> "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", …
$ block               <chr> "547", "414", "509", "467", "571", "134", "204", "…
$ street_name         <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 10", "ANG MO …
$ storey_range        <chr> "04 TO 06", "01 TO 03", "01 TO 03", "07 TO 09", "0…
$ floor_area_sqm      <dbl> 92, 92, 91, 92, 92, 98, 92, 92, 92, 92, 92, 109, 9…
$ flat_model          <chr> "New Generation", "New Generation", "New Generatio…
$ lease_commence_date <dbl> 1981, 1979, 1980, 1979, 1979, 1978, 1977, 1978, 19…
$ remaining_lease     <chr> "59 years", "57 years 09 months", "58 years 06 mon…
$ resale_price        <dbl> 370000, 375000, 380000, 385000, 410000, 410000, 41…

Next, we will be importing test data. The test data should be January and February 2023 resale prices. We will be filtering January to Feb 2023 and 4-room flat.

hdb_resale_test <- read_csv("data/aspatial/resale_flat_price_full.csv")  %>% 
  filter(flat_type == "4 ROOM") %>%
  filter(month >= "2023-01" & month <= "2023-02")

In the below data prep, we should clean the data for both test and train.

Data Preparation

Since there is no Long and Lat in the data, we need to create one. But before that we can combine the block with the street name, create category column representing story range, create Lat Long column and reformat remaining lease.

  1. Combine block and street name
hdb_resale$address <-  paste(hdb_resale$block, hdb_resale$street_name, sep=" ")
hdb_resale_test$address <-  paste(hdb_resale_test$block, hdb_resale_test$street_name, sep=" ")
  1. Looking at megan’s Takehome, she created duplicate values representing the story range. I was thinking of something different, so I decided we can categories in 4 category Low, Mid, High, very High instead.
  • Low: 01-06

  • Middle: 07-12

  • High: 13-24

  • Very High: >= 25

unique(hdb_resale$storey_range)
 [1] "04 TO 06" "01 TO 03" "07 TO 09" "10 TO 12" "13 TO 15" "16 TO 18"
 [7] "19 TO 21" "22 TO 24" "28 TO 30" "25 TO 27" "31 TO 33" "43 TO 45"
[13] "34 TO 36" "37 TO 39" "40 TO 42" "46 TO 48" "49 TO 51"

Low

hdb_resale$story_level_low <- ifelse(hdb_resale$storey_range=="01 TO 03"|hdb_resale$storey_range=="04 TO 06", 1, 0)
hdb_resale_test$story_level_low <- ifelse(hdb_resale_test$storey_range=="01 TO 03"|hdb_resale_test$storey_range=="04 TO 06", 1, 0)

Middle

hdb_resale$story_level_mid <- ifelse(hdb_resale$storey_range=="07 TO 09"|hdb_resale$storey_range=="10 TO 12", 1, 0)
hdb_resale_test$story_level_mid <- ifelse(hdb_resale_test$storey_range=="07 TO 09"|hdb_resale_test$storey_range=="10 TO 12", 1, 0)

High

hdb_resale$story_level_high <- ifelse(hdb_resale$storey_range=="13 TO 15"|hdb_resale$storey_range=="16 TO 18"|hdb_resale$storey_range=="19 TO 21"|hdb_resale$storey_range=="22 TO 24", 1, 0)
hdb_resale_test$story_level_high <- ifelse(hdb_resale_test$storey_range=="13 TO 15"|hdb_resale_test$storey_range=="16 TO 18"|hdb_resale_test$storey_range=="19 TO 21"|hdb_resale_test$storey_range=="22 TO 24", 1, 0)

Very High

hdb_resale$story_level_veryhigh <- ifelse(hdb_resale$storey_range>="25 TO 27", 1, 0)
hdb_resale_test$story_level_veryhigh <- ifelse(hdb_resale_test$storey_range>="25 TO 27", 1, 0)
  1. Create Lat Long column

The below code, i have copied the from our senior Megan website. The below code extracts the Latitude and Longitude of the address through OneMap API. I have commented after I ran the code once, the process took me about 1 hour to finish extracting the longlat.

I will be exporting it into XLSX file and read in the file.

#library(httr)
#geocode <- function(block, streetname) {
#  base_url <- "https://developers.onemap.sg/commonapi/search"
#  address <- paste(block, streetname, sep = " ")
#  query <- list("searchVal" = address, 
#                "returnGeom" = "Y",
#                "getAddrDetails" = "N",
#                "pageNum" = "1")
#  
#  res <- GET(base_url, query = query)
#  restext<-content(res, as="text")
#  
#  output <- fromJSON(restext)  %>% 
#    as.data.frame %>%
#    select(results.LATITUDE, results.LONGITUDE)
#
#  return(output)
#}
#hdb_resale$LATITUDE <- 0
#hdb_resale$LONGITUDE <- 0
#
#for (i in 1:nrow(hdb_resale)){
#  temp_output <- geocode(hdb_resale[i, 4], hdb_resale[i, 5])
#  
#  hdb_resale$LATITUDE[i] <- temp_output$results.LATITUDE
#  hdb_resale$LONGITUDE[i] <- temp_output$results.LONGITUDE
#}
#hdb_resale_test$LATITUDE <- 0
#hdb_resale_test$LONGITUDE <- 0
#
#for (i in 1:nrow(hdb_resale_test)){
#  temp_output <- geocode(hdb_resale_test[i, 4], hdb_resale_test[i,5])
#  
#  hdb_resale_test$LATITUDE[i] <- temp_output$results.LATITUDE
#  hdb_resale_test$LONGITUDE[i] <- temp_output$results.LONGITUDE
#}
#write.csv(hdb_resale,"data/exported/hdb_resale_latlong.csv")
#write.csv(hdb_resale_test,"data/exported/hdb_resale_test_latlong.csv")

Read in the hdb resale price with latlong column.

hdb_resale <- read_csv("data/exported/hdb_resale_latlong.csv")
hdb_resale_test <- read_csv("data/exported/hdb_resale_test_latlong.csv")

Lets check for any NA values

sum(is.na(hdb_resale$LATITUDE))
[1] 0
sum(is.na(hdb_resale$LONGITUDE))
[1] 0
sum(is.na(hdb_resale_test$LATITUDE))
[1] 0
sum(is.na(hdb_resale_test$LONGITUDE))
[1] 0

There is no missing value and we can proceed to the next steps.

  1. Convert Remaining lease format
str_list <- str_split(hdb_resale$remaining_lease, " ")

for (i in 1:length(str_list)) {
  if (length(unlist(str_list[i])) > 2) {
      year <- as.numeric(unlist(str_list[i])[1])
      month <- as.numeric(unlist(str_list[i])[3])
      hdb_resale$remaining_lease[i] <- year + round(month/12, 2)
  }
  else {
    year <- as.numeric(unlist(str_list[i])[1])
    hdb_resale$remaining_lease[i] <- year
  }
}
str_list <- str_split(hdb_resale_test$remaining_lease, " ")

for (i in 1:length(str_list)) {
  if (length(unlist(str_list[i])) > 2) {
      year <- as.numeric(unlist(str_list[i])[1])
      month <- as.numeric(unlist(str_list[i])[3])
      hdb_resale_test$remaining_lease[i] <- year + round(month/12, 2)
  }
  else {
    year <- as.numeric(unlist(str_list[i])[1])
    hdb_resale_test$remaining_lease[i] <- year
  }
}

Converting aspatial data dataframe into a sf object.

Currently, the hdb_resale tibble data frame is aspatial. We will convert it to a sf object and the output will be in point feature form. The code chunk below converts hdb_resale data frame into a simple feature data frame by using st_as_sf() of sf packages.

hdb_resale.sf <- st_as_sf(hdb_resale,
                            coords = c("LONGITUDE", "LATITUDE"),
                            crs=4326) %>%
  st_transform(crs=3414)

hdb_resale_test.sf <- st_as_sf(hdb_resale_test,
                            coords = c("LONGITUDE", "LATITUDE"),
                            crs=4326) %>%
  st_transform(crs=3414)
head(hdb_resale.sf)
Simple feature collection with 6 features and 17 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 28960.32 ymin: 38439.17 xmax: 30770.07 ymax: 39578.64
Projected CRS: SVY21 / Singapore TM
# A tibble: 6 × 18
   ...1 month   town       flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶
  <dbl> <chr>   <chr>      <chr>   <chr> <chr>   <chr>     <dbl> <chr>     <dbl>
1     1 2021-01 ANG MO KIO 4 ROOM  547   ANG MO… 04 TO …      92 New Ge…    1981
2     2 2021-01 ANG MO KIO 4 ROOM  414   ANG MO… 01 TO …      92 New Ge…    1979
3     3 2021-01 ANG MO KIO 4 ROOM  509   ANG MO… 01 TO …      91 New Ge…    1980
4     4 2021-01 ANG MO KIO 4 ROOM  467   ANG MO… 07 TO …      92 New Ge…    1979
5     5 2021-01 ANG MO KIO 4 ROOM  571   ANG MO… 07 TO …      92 New Ge…    1979
6     6 2021-01 ANG MO KIO 4 ROOM  134   ANG MO… 07 TO …      98 New Ge…    1978
# … with 8 more variables: remaining_lease <chr>, resale_price <dbl>,
#   address <chr>, story_level_low <dbl>, story_level_mid <dbl>,
#   story_level_high <dbl>, story_level_veryhigh <dbl>, geometry <POINT [m]>,
#   and abbreviated variable names ¹​flat_type, ²​street_name, ³​storey_range,
#   ⁴​floor_area_sqm, ⁵​flat_model, ⁶​lease_commence_date

Proximity Distance Calculation

Moving on, we need to find is the proximity to particular facilities - which we can compute with st_distance(), and find the closest facility (shortest distance) with the rowMins() function of our matrixStats package. The values will be appended to the data frame as a new column. (the below code will be credited to our senior Megan)

#library(units)
#library(matrixStats)
#proximity <- function(df1, df2, varname) {
#  dist_matrix <- st_distance(df1, df2) %>%
#    drop_units()
#  df1[,varname] <- rowMins(dist_matrix)
#  return(df1)
#}
#hdb_resale_train.sf <- 
#  # the columns will be truncated later on when viewing 
#  # so we're limiting ourselves to two-character columns for ease of #viewing #etween
#  proximity(hdb_resale.sf, busstop.sf, "PROX_BS") %>%
#  proximity(.,childcare.sf, "PROX_CHILDCARE") %>%
#  proximity(., eldercare.sf, "PROX_ELDERCARE") %>%
#  proximity(., foodarea.sf, "PROX_FOOD") %>%
#  proximity(., trainstation.sf, "PROX_MRT") %>%
#  proximity(., good_prischool.sf, "PROX_TOPPRISCH") %>%
#  proximity(., shoppingmalls.sf, "PROX_MALL") %>%
#  proximity(., supermarkets.sf, "PROX_SPRMKT") %>%
#  proximity(., hospital.sf, "PROX_HOST") 

#hdb_resale_test <- 
#  # the columns will be truncated later on when viewing 
#  # so we're limiting ourselves to two-character columns for ease of #viewing #etween
#  proximity(hdb_resale_test.sf, busstop.sf, "PROX_BS") %>%
#  proximity(.,childcare.sf, "PROX_CHILDCARE") %>%
#  proximity(., eldercare.sf, "PROX_ELDERCARE") %>%
#  proximity(., foodarea.sf, "PROX_FOOD") %>%
#  proximity(., trainstation.sf, "PROX_MRT") %>%
#  proximity(., good_prischool.sf, "PROX_TOPPRISCH") %>%
#  proximity(., shoppingmalls.sf, "PROX_MALL") %>%
#  proximity(., supermarkets.sf, "PROX_SPRMKT") %>%
#  proximity(., hospital.sf, "PROX_HOST") 

After we ran the proximity distance, lets run the number of radius as well.

#num_radius <- function(df1, df2, varname, radius) {
#  dist_matrix <- st_distance(df1, df2) %>%
#    drop_units() %>%
#    as.data.frame()
#  df1[,varname] <- rowSums(dist_matrix <= radius)
#  return(df1)
#}
#hdb_resale_train_final <- 
#  num_radius(hdb_resale_train.sf, foodarea.sf, "NUM_FOOD", 350) %>%
#  num_radius(., childcare.sf, "NUM_CHILDCARE", 350) %>%
#  num_radius(., busstop.sf, "NUM_BUS_STOP", 350) %>%
#  num_radius(., supermarkets.sf, "NUM_SPMKT", 350) %>%
#  num_radius(., primaryschool.sf, "NUM_SCHOOL", 1000)

#hdb_resale_test_final <- 
#  num_radius(hdb_resale_test, foodarea.sf, "NUM_FOOD", 350) %>%
#  num_radius(., childcare.sf, "NUM_CHILDCARE", 350) %>%
#  num_radius(., busstop.sf, "NUM_BUS_STOP", 350) %>%
#  num_radius(., supermarkets.sf, "NUM_SPMKT", 350) %>%
#  num_radius(., primaryschool.sf, "NUM_SCHOOL", 1000)
#st_write(hdb_resale_train_final, "data/exported/hdb_resale_train_final.shp")
#st_write(hdb_resale_test_final, "data/exported/hdb_resale_test_final.shp")
#write.csv(hdb_resale_train_final, "data/exported/hdb_resale_train.csv")
#write.csv(hdb_resale_test_final, "data/exported/hdb_resale_test.csv")

Read in the exported file

train_resale <- st_read(dsn = "data/exported", layer ="hdb_resale_train_final")
Reading layer `hdb_resale_train_final' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\exported' 
  using driver `ESRI Shapefile'
Simple feature collection with 23656 features and 31 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11519.79 ymin: 28217.39 xmax: 42645.18 ymax: 48741.06
Projected CRS: SVY21 / Singapore TM
test_resale <- st_read(dsn = "data/exported", layer ="hdb_resale_test_final")
Reading layer `hdb_resale_test_final' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\exported' 
  using driver `ESRI Shapefile'
Simple feature collection with 1848 features and 31 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11655.33 ymin: 28287.8 xmax: 42444.75 ymax: 48675.05
Projected CRS: SVY21 / Singapore TM

Computing Correlation Matrix

Lets examine if there is a sign of multicolinearity.

notice that we wanna check the correlation for remaining lease as well. we can swap the column position.

train_resale<- train_resale[c(1,2,3,4,5,6,7,8,9,10,12,13,11,14:32)]
train_resale$rmnng_l <- as.numeric(as.character(train_resale$rmnng_l))
train_resale_nogeom <- train_resale %>%
  st_drop_geometry()
corrplot::corrplot(cor(train_resale_nogeom[,13:31]), 
                   diag = FALSE, 
                   order = "AOE",
                   tl.pos = "td", 
                   tl.cex = 0.5, 
                   method = "number", 
                   type = "upper")

In the correlation matrix, we can see that story_level_low AND story_level_middle has a negative correlation between each other. Negative correlation can deduce that when x increases, y decreases. This shows that when resident bought a higher level(decrease), the occupancy of lower level is high(increases). Hence, they show a negative correlation.

we can see that NUM_SPM AND PROX_SP has a negative correlation between each other. Negative correlation can deduce that when x increases, y decreases. This shows that when a shopping mall distance is further (increase), the number of shopping mall resident would go is lower (decreases). Hence, they show a negative correlation.

Even though the two variables may have a moderate negative correlation, this does not necessarily imply that the behavior of one has any causal influence on the other. Hence, I decided to keep them.

Building a non-spatial multiple linear regression

set.seed(99)
resale_train_mlr <- lm(rsl_prc ~ flr_r_s+ rmnng_l + stry_lvl_l + stry_lvl_m + stry_lvl_h + stry_lvl_v +
                  PROX_BS + PROX_CH + PROX_EL +
                  PROX_FO + PROX_MR + PROX_TO + 
                  PROX_MA + PROX_SP + PROX_HO + NUM_FOO +
                  NUM_CHI + NUM_BUS +
                  NUM_SPM + NUM_SCH,
                data=train_resale)
summary(resale_train_mlr)

Call:
lm(formula = rsl_prc ~ flr_r_s + rmnng_l + stry_lvl_l + stry_lvl_m + 
    stry_lvl_h + stry_lvl_v + PROX_BS + PROX_CH + PROX_EL + PROX_FO + 
    PROX_MR + PROX_TO + PROX_MA + PROX_SP + PROX_HO + NUM_FOO + 
    NUM_CHI + NUM_BUS + NUM_SPM + NUM_SCH, data = train_resale)

Residuals:
    Min      1Q  Median      3Q     Max 
-352809  -52221   -5835   47721  455069 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.828e+05  1.011e+04  27.975  < 2e-16 ***
flr_r_s      2.346e+03  7.884e+01  29.756  < 2e-16 ***
rmnng_l      4.457e+03  4.632e+01  96.226  < 2e-16 ***
stry_lvl_l  -2.064e+05  2.987e+03 -69.085  < 2e-16 ***
stry_lvl_m  -1.798e+05  2.962e+03 -60.715  < 2e-16 ***
stry_lvl_h  -1.505e+05  2.986e+03 -50.410  < 2e-16 ***
stry_lvl_v          NA         NA      NA       NA    
PROX_BS     -1.409e+01  9.991e+00  -1.410 0.158526    
PROX_CH      2.501e+01  6.921e+00   3.613 0.000303 ***
PROX_EL     -2.326e+01  9.092e-01 -25.584  < 2e-16 ***
PROX_FO     -7.040e+01  4.673e+00 -15.067  < 2e-16 ***
PROX_MR     -1.903e+01  1.345e+00 -14.150  < 2e-16 ***
PROX_TO     -1.879e+01  2.960e-01 -63.493  < 2e-16 ***
PROX_MA      2.402e+01  1.720e+00  13.968  < 2e-16 ***
PROX_SP     -1.769e+01  4.708e+00  -3.757 0.000172 ***
PROX_HO      7.465e-01  4.329e-01   1.724 0.084672 .  
NUM_FOO      1.704e+03  3.807e+01  44.758  < 2e-16 ***
NUM_CHI     -3.734e+03  2.962e+02 -12.608  < 2e-16 ***
NUM_BUS     -7.465e+02  1.998e+02  -3.736 0.000188 ***
NUM_SPM      1.557e+03  6.302e+02   2.471 0.013472 *  
NUM_SCH     -8.290e+03  3.987e+02 -20.791  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 78910 on 23636 degrees of freedom
Multiple R-squared:  0.629, Adjusted R-squared:  0.6287 
F-statistic:  2109 on 19 and 23636 DF,  p-value: < 2.2e-16

The R-squared of 0.629 reveals that the simple regression model built is able to explain about 63% of the resale prices.

Since p-value is much smaller than 0.0001, we will reject the null hypothesis that mean is a good estimator of resale price. This will allow us to infer that simple linear regression model above is a good estimator of resale price.

Looking at p-value <0.05, we can see that not all the independent variables are statistically significant, and said variables should be removed. The variables I’ve identified as insignificant are surprisingly, PROX_BS and PROX_HO. We will revised the model by removing those variables which are not statistically significant and are ready to calibrate the revised model by using the code chunk below.

resale_train_mlr_revised <- lm(rsl_prc ~ flr_r_s+ rmnng_l + stry_lvl_l + stry_lvl_m + stry_lvl_h  +
                  PROX_CH + PROX_EL +
                  PROX_FO + PROX_MR + PROX_TO + 
                  PROX_MA + PROX_SP + NUM_FOO +
                  NUM_CHI + NUM_BUS +
                  NUM_SPM + NUM_SCH,
                data=train_resale)
ols_regress(resale_train_mlr_revised)
                            Model Summary                              
----------------------------------------------------------------------
R                       0.793       RMSE                    78916.388 
R-Squared               0.629       Coef. Var                  15.000 
Adj. R-Squared          0.629       MSE                6227796316.198 
Pred R-Squared          0.628       MAE                     61293.737 
----------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                     ANOVA                                       
--------------------------------------------------------------------------------
                    Sum of                                                      
                   Squares           DF       Mean Square       F          Sig. 
--------------------------------------------------------------------------------
Regression    2.495146e+14           17      1.467733e+13    2356.745    0.0000 
Residual      1.472126e+14        23638    6227796316.198                       
Total         3.967272e+14        23655                                         
--------------------------------------------------------------------------------

                                          Parameter Estimates                                            
--------------------------------------------------------------------------------------------------------
      model           Beta    Std. Error    Std. Beta       t        Sig           lower          upper 
--------------------------------------------------------------------------------------------------------
(Intercept)     281742.067     10015.130                  28.132    0.000     262111.767     301372.367 
    flr_r_s       2360.038        78.360        0.125     30.118    0.000       2206.447       2513.628 
    rmnng_l       4444.765        45.883        0.470     96.872    0.000       4354.832       4534.698 
 stry_lvl_l    -206633.466      2984.270       -0.773    -69.241    0.000    -212482.828    -200784.104 
 stry_lvl_m    -180118.191      2958.808       -0.678    -60.875    0.000    -185917.646    -174318.736 
 stry_lvl_h    -150784.027      2984.073       -0.463    -50.530    0.000    -156633.001    -144935.052 
    PROX_CH         24.233         6.901        0.015      3.512    0.000         10.707         37.759 
    PROX_EL        -22.906         0.890       -0.109    -25.727    0.000        -24.651        -21.161 
    PROX_FO        -70.566         4.667       -0.069    -15.119    0.000        -79.714        -61.418 
    PROX_MR        -18.937         1.344       -0.062    -14.089    0.000        -21.572        -16.303 
    PROX_TO        -18.584         0.273       -0.335    -68.081    0.000        -19.119        -18.049 
    PROX_MA         23.417         1.687        0.066     13.884    0.000         20.111         26.722 
    PROX_SP        -17.711         4.704       -0.021     -3.765    0.000        -26.931         -8.490 
    NUM_FOO       1690.104        37.357        0.207     45.242    0.000       1616.883       1763.326 
    NUM_CHI      -3746.632       296.140       -0.057    -12.652    0.000      -4327.085      -3166.180 
    NUM_BUS       -632.164       188.244       -0.014     -3.358    0.001      -1001.136       -263.193 
    NUM_SPM       1563.971       629.696        0.013      2.484    0.013        329.727       2798.215 
    NUM_SCH      -8479.605       388.360       -0.105    -21.834    0.000      -9240.815      -7718.395 
--------------------------------------------------------------------------------------------------------
train_data_sp <- as_Spatial(train_resale)
train_data_sp
class       : SpatialPointsDataFrame 
features    : 23656 
extent      : 11519.79, 42645.18, 28217.39, 48741.06  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 31
names       : X___1,   month,       town, flt_typ, block,       strt_nm,  stry_rn, flr_r_s,       flt_mdl, ls_cmm_, rsl_prc,          address, rmnng_l, stry_lvl_l, stry_lvl_m, ... 
min values  :     1, 2021-01, ANG MO KIO,  4 ROOM,     1,  ADMIRALTY DR, 01 TO 03,      70, Adjoined flat,    1967,  250000,   1 CHAI CHEE RD,    44.5,          0,          0, ... 
max values  : 23656, 2022-12,     YISHUN,  4 ROOM,    9B, YUNG SHENG RD, 49 TO 51,     145,       Type S1,    2019, 1370000, 9B BOON TIONG RD,   97.33,          1,          1, ... 

The weights have a very large influence on the parameter estimation of the geographically weighted regression (GWR). The weights show the relationship between observations or locations in the model. Types of weights that are often used in GWR are Gaussian kernels. This weighting can also be arranged into two forms. There are the fixed Gaussian kernel and the adaptive Gaussian kernel. Fixed is used when each location has the same bandwidth value. Adaptive is used when each location has a different bandwidth value. Adaptive is appropriate if points are irregularly spread – it ensures that there are enough points to calibrate the regression. In this take home, I’ve decided to use adaptive

#bw_adaptive <- bw.gwr(rsl_prc ~ flr_r_s+ rmnng_l + stry_lvl_l + stry_lvl_m + #stry_lvl_h  +
#                  PROX_CH + PROX_EL +
#                  PROX_FO + PROX_MR + PROX_TO + 
#                  PROX_MA + PROX_SP + NUM_FOO +
#                  NUM_CHI + NUM_BUS +
#                  NUM_SPM + NUM_SCH,
#                  data=train_data_sp,
#                  approach="CV",
#                  kernel="gaussian",
#                  adaptive=TRUE,
#                  longlat=FALSE)

#saveRDS(bw_adaptive, "bwadaptive.rds")

Read in RDS

bw_adaptive <- read_rds("bwadaptive.rds")

Now, we can go ahead to calibrate the gwr-based hedonic pricing model by using adaptive bandwidth and Gaussian kernel as shown in the code chunk below.

#gwr_adaptive <- gwr.basic(formula = rsl_prc ~ flr_r_s+ rmnng_l + stry_lvl_l + #stry_lvl_m + stry_lvl_h  +
#                  PROX_CH + PROX_EL +
#                  PROX_FO + PROX_MR + PROX_TO + 
#                  PROX_MA + PROX_SP + NUM_FOO +
#                  NUM_CHI + NUM_BUS +
#                  NUM_SPM + NUM_SCH,
#                         data=train_data_sp,
#                          bw=bw_adaptive, 
#                          kernel = 'gaussian', 
#                          adaptive=TRUE,
#                          longlat = FALSE)
#gwr_adaptive
#saveRDS(gwr_adaptive, "gwradaptive.rds")
gwr_adaptive <- read_rds("gwradaptive.rds")

Preparing Coordinates data

coords_train <- st_coordinates(train_resale)
coords_test <- st_coordinates(test_resale)
#write_rds(coords_train, "coords_train.rds" )
#write_rds(coords_test, "coords_test.rds" )

Dropping geometry field

train_data <- train_resale %>% 
  st_drop_geometry()

Calibrating Random Forest Model

set.seed(999)
rf <- ranger(rsl_prc ~ flr_r_s+ rmnng_l + stry_lvl_l + stry_lvl_m + stry_lvl_h  +
                  PROX_CH + PROX_EL +
                  PROX_FO + PROX_MR + PROX_TO + 
                  PROX_MA + PROX_SP + NUM_FOO +
                  NUM_CHI + NUM_BUS +
                  NUM_SPM + NUM_SCH,
             data=train_data)
print(rf)
Ranger result

Call:
 ranger(rsl_prc ~ flr_r_s + rmnng_l + stry_lvl_l + stry_lvl_m +      stry_lvl_h + PROX_CH + PROX_EL + PROX_FO + PROX_MR + PROX_TO +      PROX_MA + PROX_SP + NUM_FOO + NUM_CHI + NUM_BUS + NUM_SPM +      NUM_SCH, data = train_data) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      23656 
Number of independent variables:  17 
Mtry:                             4 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       1276057265 
R squared (OOB):                  0.9239146 

Calibrating Geographical Random Forest Model

The code chunk below calibrate a geographic ranform forest model by using grf() of SpatialML package. Previously, we have calculated the adjusted bandwidth, we will use that value for our grf() function. We decided to use adaptive and should be consistent for the kernel

#set.seed(99)
#gwRF_adaptive <- grf(formula = rsl_prc ~ flr_r_s+ rmnng_l + stry_lvl_l + stry_lvl_m #+ stry_lvl_h  +
#                  PROX_CH + PROX_EL +
#                  PROX_FO + PROX_MR + PROX_TO + 
#                  PROX_MA + PROX_SP + NUM_FOO +
#                  NUM_CHI + NUM_BUS +
#                  NUM_SPM + NUM_SCH,
#                     dframe=train_data,
#                      ntree=50,
#                     bw=bw_adaptive,
#                     kernel="adaptive",
#                     coords=coords_train
#                )
#write_rds(gwRF_adaptive, "gwRF_adaptive_50trees.rds")
gwRF_adaptive <- read_rds("gwRF_adaptive_50trees.rds")

Predicting by using test data

test_data <- cbind(test_resale, coords_test) %>%
  st_drop_geometry()
#gwRF_pred <- predict.grf(gwRF_adaptive, 
#                           test_data, 
#                           x.var.name="X",
#                           y.var.name="Y", 
#                           local.w=1,
#                           global.w=0)
#write_rds(gwRF_pred, "GRF_pred.rds")

Converting the predicting output into a data frame

GRF_pred <- read_rds("GRF_pred.rds")
GRF_pred_df <- as.data.frame(GRF_pred)
test_data_p <- cbind(test_data, GRF_pred_df)

Calculating Root Mean Square Error

rmse(test_data_p$rsl_prc, 
     test_data_p$GRF_pred)
[1] 105026.9

Visualizing the predicted values

ggplot(data = test_data_p,
       aes(x = GRF_pred,
           y = rsl_prc)) +
  geom_point()